Compare concat(), merge(), and join() in Pandas

Table 1: Employees
Employee ID	Name	Department
101	Fatimah Al-Zahrani	Finance
102	Ahmed Al-Zahrani	Human Resources
103	Mohammed Al-Zahrani	IT
104	Sheikha Al-Zahrani	Marketing

Table 2: Salaries
Employee ID	Salary	Bonus
101	5000	500
102	6000	400
103	7000	600
105	8000	700

Purpose: The concat() function is used to concatenate two or more DataFrames along a specific axis, either rows or columns. It is commonly used for combining datasets when you want to stack or add new rows/columns.

Key Features:

Works with both DataFrames and Series.
Allows concatenation along rows (axis=0) or columns (axis=1).
Handles duplicate indices by default, but you can reset or ignore them using the ignore_index parameter.

Limitations:

Requires matching data structure.
Does not support merging on columns.

Example Code:

 
import pandas as pd

# DataFrames based on the example tables
df_employees = pd.DataFrame({
    'Employee ID': [101, 102, 103, 104],
    'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
    'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})

df_salaries = pd.DataFrame({
    'Employee ID': [101, 102, 103, 105],
    'Salary': [5000, 6000, 7000, 8000],
    'Bonus': [500, 400, 600, 700]
})

# Concatenate along rows (axis=0)
result = pd.concat([df_employees, df_salaries], ignore_index=True)

print(result)

Output :

    Employee ID                Name          Department  Salary  Bonus
0          101  Fatimah Al-Zahrani             Finance     NaN    NaN
1          102    Ahmed Al-Zahrani     Human Resources     NaN    NaN
2          103  Mohammed Al-Zahrani                  IT     NaN    NaN
3          104   Sheikha Al-Zahrani          Marketing     NaN    NaN
4          101                  NaN                NaN   5000    500
5          102                  NaN                NaN   6000    400
6          103                  NaN                NaN   7000    600
7          105                  NaN                NaN   8000    700

Explanation: In this example, the concat() function is used to combine the two tables, Employees and Salaries. Since they have different columns, NaN values are inserted where data is missing. This demonstrates how concatenation works by stacking the rows of one DataFrame below the other.

Key Features:

Supports various join types: inner, outer, left, and right.
Merges can be based on one or multiple columns or indices.
Customizable suffixes to handle overlapping column names.

Limitations:

Slower compared to concat() in some cases.

Example Code:

 
import pandas as pd
# DataFrames based on the example tables
df_employees = pd.DataFrame({
    'Employee ID': [101, 102, 103, 104],
    'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
    'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})

df_salaries = pd.DataFrame({
    'Employee ID': [101, 102, 103, 105],
    'Salary': [5000, 6000, 7000, 8000],
    'Bonus': [500, 400, 600, 700]
})

# Merge the DataFrames on 'Employee ID' column (inner join by default)
result_inner = pd.merge(df_employees, df_salaries, on='Employee ID')

# Merge the DataFrames with a left join
result_left = pd.merge(df_employees, df_salaries, on='Employee ID', how='left')

print("Inner Join Result:")
print(result_inner)

print("\nLeft Join Result:")
print(result_left)

Output :(Inner Join)

   Employee ID                Name          Department  Salary  Bonus
0          101  Fatimah Al-Zahrani             Finance   5000    500
1          102    Ahmed Al-Zahrani     Human Resources   6000    400
2          103  Mohammed Al-Zahrani                  IT   7000    600

Output :(Left Join)

   Employee ID                Name          Department  Salary  Bonus
0          101  Fatimah Al-Zahrani             Finance   5000  500.0
1          102    Ahmed Al-Zahrani     Human Resources   6000  400.0
2          103  Mohammed Al-Zahrani                  IT   7000  600.0
3          104   Sheikha Al-Zahrani          Marketing    NaN    NaN

Explanation:

In the inner join, only rows with matching Employee ID values in both DataFrames are included. Thus, the row for Employee ID 105 from df_salaries and Employee ID 104 from df_employees are excluded.
In the left join, all rows from the left DataFrame (df_employees) are retained, even if there's no match in the right DataFrame (df_salaries). Missing values in the right DataFrame are filled with NaN.
This example illustrates how merging is used to combine datasets with related but not identical information.

Purpose: The join() function is used to combine two DataFrames based on their indices or a specified key column. It is particularly useful for index-based merges and simplifies the process when the indices of both DataFrames are aligned or need alignment.

Key Features:

Default behavior joins DataFrames on their indices.
Supports different join types: left, right, outer, and inner.
Customizable suffixes for overlapping column names.

Limitations:

Primarily works on the index and may require customization for column-based merging.

Example Code:

 
import pandas as pd

# Create two DataFrames: one with employee details and another with salary information
df_employees = pd.DataFrame({
    'Employee ID': [101, 102, 103, 104],
    'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
    'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})

df_salaries = pd.DataFrame({
    'Employee ID': [101, 102, 103, 105],
    'Salary': [5000, 6000, 7000, 8000],
    'Bonus': [500, 400, 600, 700]
})

# Join the DataFrames on 'Employee ID'
result = df_employees.set_index('Employee ID').join(df_salaries.set_index('Employee ID'))

print(result)

Output :

                    Name            Department  Salary  Bonus
Employee ID                                           
101      Fatimah Al-Zahrani       Finance    5000    500
102      Ahmed Al-Zahrani   Human Resources    6000    400
103  Mohammed Al-Zahrani           IT    7000    600
104      Sheikha Al-Zahrani   Marketing    NaN     NaN

Explanation:

The join() function is used here to combine the df_employees and df_salaries DataFrames using their common column Employee ID as the index. Both DataFrames are first set to have Employee ID as their index using set_index().


          The rows from the df_employees DataFrame are retained, while the corresponding salary and
            bonus data from the df_salaries DataFrame is added based on matching Employee ID.
          
          For Employee ID 104, there is no corresponding entry in df_salaries, so the
            resulting columns for Salary and Bonus are NaN for that row.

Feature	`concat()`	`merge()`	`join()`
Purpose	Merges DataFrames along a specified axis (row-wise or column-wise).	Performs SQL-like joins based on common columns or indices.	Joins DataFrames using their index.
Primary Use Case	Stacking or appending DataFrames.	Combining datasets with key-based relationships.	Efficiently joining on index without manual key matching.
Default Behavior	Concatenates along rows (`axis=0`).	Performs an `inner` join by default.	Performs a `left` join by default.
Join Type Supported	Not applicable (direct concatenation).	`inner`, `outer`, `left`, `right`.	`inner`, `outer`, `left`, `right`.
Index Handling	Keeps original index unless `ignore_index=True` is used.	Can merge on index or column with `on`, `left_on`, `right_on`.	Uses index by default but can specify a column with `on=`.
Duplicates Handling	Retains all rows unless manually removed.	Keeps duplicate rows unless handled separately.	Duplicates remain unless manually handled.
NaN Handling	Preserves NaN values unless `fillna()` is used.	Introduces NaNs for non-matching keys in `outer` joins.	Introduces NaNs for missing index matches unless `how='inner'`.
MultiIndex Support	Yes, supports MultiIndex concatenation.	Yes, supports MultiIndex merging.	Yes, primarily for index-based operations.
Performance	Faster for simple stacking operations.	Slower due to key-based lookups.	Faster than `merge()` for index-based joins.
Example Usage	`pd.concat([df1, df2], axis=0)`	`df1.merge(df2, on='key', how='inner')`	`df1.join(df2, how='left')`

Comparing concat(), merge(), and join() in Pandas

Tables Used in the Examples

1. The `concat()` Function

2. The `merge()` Function

3. The `join()` Function

Summary Comparison Between `concat()`, `merge()`, and `join()`

Test Your Knowledge

Tables Used in the Examples

1. The concat() Function

2. The merge() Function

3. The join() Function

Summary Comparison Between concat(), merge(), and join()

Test Your Knowledge

1. The `concat()` Function

2. The `merge()` Function

3. The `join()` Function

Summary Comparison Between `concat()`, `merge()`, and `join()`